Extracting Aggregate Answer Statistics for Integration
نویسندگان
چکیده
Aggregate queries in integration contexts often do not have one “true” answer; there can be multiple correct answers for the same aggregate query. This is due to the existence of duplicate or overlapping data points, possibly with di↵erent values, across the data sources. Depending on the choice of data source combinations that are used to answer the query, di↵erent answers can be generated. Thus, representing the answer to the aggregate query as an answer distribution instead of a single scalar value, will allow the users to better understand the range of possible answers. This work provides a suite of methods for extracting statistics that convey meaningful information about aggregate query answers in heterogeneous integration settings. We focus on the following challenges: 1. determining which statistics best represent an answer’s distribution; and 2. e ciently computing the desired statistics. Our solution includes the following answer statistics 1. a set of point estimates with confidence intervals; 2. a high coverage interval that unveils “hot areas” in a distribution; and 3. a stability score that measures the impact of source dynamics. We optimize the extraction of the above statistical information by minimizing the sampling load and applying fast approximate algorithms. We verify the e↵ectiveness and e ciency of our methods with empirical studies using real-life and synthetic, scaled data sets.
منابع مشابه
WORK IN PROGRESS: Data Explorer – Assessment Data Integration, An- alytics, and Visualization for STEM Education Research
We describe a comprehensive system for comparative evaluation of uploaded and preprocessed data in physics education research with applicability to standardized assessments for disciplinebased education research, especially in science, technology, mathematics, and engineering. Views are provided for inspection of aggregate statistics about student scores, comparison over time within one course,...
متن کاملارائه روشی پویا جهت پاسخ به پرسوجوهای پیوسته تجمّعی اقتضایی
Data Streams are infinite, fast, time-stamp data elements which are received explosively. Generally, these elements need to be processed in an online, real-time way. So, algorithms to process data streams and answer queries on these streams are mostly one-pass. The execution of such algorithms has some challenges such as memory limitation, scheduling, and accuracy of answers. They will be more ...
متن کاملScaling the walls of discovery: using semantic metadata for integrative problem solving
Current data integration approaches by bioinformaticians frequently involve extracting data from a wide variety of public and private data repositories, each with a unique vocabulary and schema, via scripts. These separate data sets must then be normalized through the tedious and lengthy process of resolving naming differences and collecting information into a single view. Attempts to consolida...
متن کاملStrategic Human Resource Development Model Designing in National Iranian Oil Company
Today human resource Strategies are very important for human resource systems. While talking about strategies, integration and coordination is something is much more importance than strategies formulation and implementations. This study presents a model for the strategic development of human resource based on competencies model. In fact, the main question of this research is: What are the affec...
متن کاملExtended aggregations for databases with referential integrity issues
Querying databases with incomplete or inconsistent content remains a broad and difficult problem. In this work, we study how to improve aggregations computed on databases with referential errors in the context of database integration, where each source database has different tables, columns with similar content across multiple databases, but different referential integrity constraints. Thus, a ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015